A robust category guesser for Dutch medical language
نویسنده
چکیده
In this paper, we want to describe the architecture and some of the implementation issues of a large scale category guesser for Dutch medical vocabulary. We Mso provide numerical data on the precision and coverage of this category guesser, which has to cover for the moment only the vocabulary of the cardiology domain. The category guesser uses non-morphologic information (endstring matching) as well as truly morphologic knowledge (inflection, derivation and compounding). Since we deal with a sublanguage some linguistic features are easier to handle (Grishman and Kittredge, 1986), (Sager et al., 1987). Subsequently we will describe in detail the differents parts which interact to successfully identify unknown medical words. 1 I n t r o d u c t i o n 1.1 N L P in medic ine Medical patient reports consists mainly of free text, combined with results of various laboratories. While numerical data can easily be stored and processed for archiving and research purposes, free text is rather difficult to be treated by computer, although it contains the most relevant information. Several authors put forward the hypothesis that Natural Language Processing (NLP) and Knowledge Representation (KR) of medical discharge summaries have become the key-issues in the domMn of intelligent medical information processing (Baud et al., 1992), (Gabrieli and Speth, 1987), (McCray, 1991). However, only a few NLP-driven systems have actually been implemented (Friedman and Johnson, 1992) . For Dutch, a limited prototype has been developed (Spyns, 1991), (Spyns and Adriaens, 1992). A broader system covering a larger part of the Dutch grammar and medical vocabulary is currently under development• 150 This activity forms part of the MENELAS-project 1 • This project comprises a morphological, syntactic, semantic and pragmatic analysis of the medical sublanguage for Dutch, English and French (Spyns et al., 1992). The project also focuses on Knowledge Representation (by means of Conceptual Graphs) (Sowa, 1984), (Volot et al., 1993) and Production Systems (Bouaud and Zweigenbaum, 1992). 1.2 The Category Guesser for D u t c h Medical Language This paper focuses on the morphological and lexical component of the system, which is a combination of a database application and a Prolog rule interpreter. This component is already functioning and is used continuously during the current extension of the coverage of the Dutch grammar (Spyns et al., 1993). The importance of morphologic analysis of medical vocabulary has been widely recognised (Wingert, 1985), (Wolff, 1984), (Dujols et al., 1991), (Pacak and Pratt, 1969) (Pacak and Pratt, 1978), (Norton, 1983)• In the following sections, we will describe the different parts which interact to identify the word forms of a given sentence• The various stages of the analysis of the word forms are described• A major distinction can be made between forms "known by the system" (= stored in the dictionary cf. section 2) and unknown forms whose linguistic characteristics need to be computed and are hypothetical. The latter can be based on morphologic knowledge (section 3) or other heuristics (sections 4, 5 ~z 6). Each section is illustrated by an example or some implementation details. A schematic overview of the architecture of the category guesser is presented in section 7. The subsequent section (8) is devoted to the evaluation, which will guide the further elaboration of the here described category guesser. The paper ends with a conclusion and discussion (section 9). 1The MENELAS-project (AIM #2023) is financed by the Directorate General XIII of the European Community (Zweigenbaum and others, 1991). [ l e x : g e p r o b e e r d , n l l u : g e p r o b e e r d , c a t : n , n b : s i n g , p e r s : 3 ] [ l e x : g e p r o b e e r d , n l ~ u : g e p r o b e e r d , c a t : a d j , a d j t y p e : o r d , a d j ~ : n o ] [lex:geprobeerd,nllu:proberen,cat:v,pers:nil,nb:nil,tense:nil,vform:pastpart] [lex:geprobeerd,nllu:proberen,cat:adj,adjtype:papa,adj~:no] [~ex:gepr~beerd~n~u:gepr~beerden~ca~:v~pers:ni~nb:ni~tense:ni~vf~rm:pastpart] [lex:geprobeerd,nllu:geprobeerden,cat:adj,adjtype:papa,adj~:no] [lex:geprobeerd,nllu:geproberen,cat:adj ,adjtype:papa,adj~:no] [lex:geprobeerd,nllu:geprobeerden,cat:v,pers:1,nb:sing,tense:pres,vform:finite] Figure 1: Example of Cohort for "geprobeerd" 2 F u l l F o r m D i c t i o n a r y The lexical database for Dutch was built using several resources: an existing electronic valency dict ionary and a list of words extracted from a medical corpus (cardiology patient discharge summaries). The already existing electronic dictionary (resulting from the K.U. Leuven PROTON-project (Dehaspe and Van Langendonck, ) and the newly coded entries were converted and merged into a common representation in a relational database (Dehaspe, 1993). It is intended to use the category guesser (cf. infra) as little as possible. To that extent, the dictionary is conceived as a full-form dictionary. Currently, there are some 100.000 full forms in the lexical database (which is some 8000 non inflected forms). However, since an exhaustive dictionary is an unrealistic assumption, a category guesser handles all the unknown word forms. The unknown words trigger a set of rules to identify the surface form, to at tr ibute syntactic categories to it, and to calculate the possible canonical form(s). The category guesser can also enhance the robustness of the larger NLP-system since misspelled words can receive, to a certain extent, correct syntactic features. To reach this aim, the category guesser combines morphologic (3) as well as non morphologic knowledge (sections 4 & 6). 3 M o r p h o l o g i c a l A n a l y s i s 3.1 P r e l i m i n a r y R e m a r k s The morphological analyser consists mainly of three sections, which correspond more or less to the three linguistic operations on words: inflection, derivation and compounding. However, from an implementational point of view, the boundaries between derivation and compounding are defined in a different way. The compounds, created by agglutination or combined by means of a hyphen are computationally treated as non-compounds. This implies that the same segmentation routine can be used for the computat ion of derivations and monolithical compounds (Spyns and De Wachter, 1995). 151 3.2 I n f l e c t i o n The inflection analyser produces one or more bundles of morphosyntactic feature value pairs for each submitted surface form (= cohort). The generated feature bundles comprise, among other features, the surface form (lex), the supposed canonical form (nlAu) as well as its category (cat) 2 A reduced example of the cohort produced for "geprobeerd" (Eng.: "tr ied")follows (see figure 1). The initial cohort will later on be reduced as much as possible (the ideal result in most cases being a single feature bundle). Therefore, a cascading priority system has been defined. The at tr ibute "mort" expresses the quality of the analysis, possible values being segm, suffix, string or guess with segm > suffix > string > guess. More details on this will be given below. Only the feature bundles of supposed nouns, verbs, adjectives and adverbs (i.e. the open categories) are admit ted in the initial set of hazardous analyses or cohort. 3.3 S e g m e n t a t i o n Derivation and monolithical compounding are used to try and identify as many as possible of the canonical forms computed by the inflectional analyser. The starting principle here is that the right part of the computed canonical form usually constitutes the grammatical head of the whole word. The whole word thus inherits the feature-bundle associated with its right part (Selkirk, 1982, p.150) 3 In opposition to William (Williams, 1983) & Selkirk (Selkirk, 1982), we do not allow inflectional suffixes to be heads. The right part can be found in the dictionary (monolithical compounding) or in a list of suffixes (derivation). In the current segmentation program, the major part of this list contains medical suffixes, which constitute a clearly definable 2v for verbs, adj for adjectives and n for noun; others are nb [sing or phr] for number, pers [1, 2, 3 or nil] for person. 3We are fully aware that linguistic reality is more complex: e.g. some derivations (f.i. Dutch diminutifs cf. (Ritchie et al., 1992)) are regarded as left headed. Maybe they should be treated computationally by the inflectional analyser. set that is fairly regular in its (morphological and syntactic) behaviour (Dujols et al., 1991). Below (see figure 2) one can find an extract of the suffix list. s u f f i x ( [a , r , i , s ] , [ ca t : a d j , nb: s ing] ) . s u f f i x ( [a , a , 1 ] , [ c a t : adj ,nb : s ing] ) . suff ix ( [i, e] , [cat : n, nb: s in~ ). Figure 2: Examples of Suffixes with Feature Bundle The computed canonical form is scanned and segmentated from right to left. All possible solutions are generated by a failure driven loop (no exclusive longest match principle). The segmentation routine which tries to identify a right part (head:dict or head:suffix) and then tries to recognize the remaining left part . If this succeeds, the segmentation is complete (morf:segm). Otherwise, it is only partial (morf: suffix). At the moment , only noun noun compounds are treated. Many medical noun noun compounds combine a medical non head part with a non medical head part ( f i . ha r tz iek te Eng.: heartdisease). Only those feature bundles of the cohort are kept tha t are compatible (by means of graph-unification) with the feature bundle associated with the head part (suffix or dictionary entry). At this stage of filtering, the feature cat (syntactic category) plays a most prominent role. 4 Endstr ing Matching When nothing can be predicted by means of morphology, another heuristic will be applied to reduce the set of remaining possible morphological analyses. This stage will focus more on the general language words. It is based on a series of endstrings (not limited by morphological boundaries) which determine the category of a word. Only the open syntactic classes are taken into account (noun, verb, adjective and adverb). Some endstrings uniquely identify the category of a word while others are more equivocal. The latter are correlated with two or even three categories. The necessary linguistic knowledge to build a list of non-inflected endstrings and their associated category (or categories) was found in Lemmens (Lemmens, 1989). Some combinations of an endstring and its category are shown below (see figure 3). When a computed lexical form is presented to the endstring matcher, the above mentioned list is checked to see if an endstring constitutes the endpart of the submit ted word. In fact, the surface form as well as the hypothetical canonical form of the feature bundle are submit ted to the endstring matcher. Only the categories resulting of both matching processes (= the intersection) are finally retained. Sub152 end( [d, r , e , e ] -3, [v, a d j ] , [ ee rd ] ) . e n d ( [ 1 , e , e , i l . 3 , [adj ,n] , [ i e e l ] ) . end([l,e,il-3 , [adj] , [iel]). end( [e ,m, s, i 1-3 , [n] , [isme] ) . Figure 3: Some endstring-category combinations sequently, the feature bundle(s) of the cohort containing the proposed syntactic category are extended with an extra featurevaluepair (morf:string). Below (see figure 4) the result of endstring matching applied to the verb "geprobeerd" (Eng.: "tried") is shown (rule with ending -eerd applies) 4 The inflection rules were able to produce a canonical form together with its category which the endstring matcher considers correct. This implies tha t the inflection rule was correctly triggered and applied. As a corollary, the other syntactic information in such a validated feature bundle (with morf:string) is supposed to be correct as well. However, many syntactic features are underspecified 5 5 Default or Catch All Rule If none of the aforementioned cases apply, the computed canonical forms and its corresponding grammatical features are pure guesses. The complete cohort is retained and each of its feature bundles is extended with one extra feature morf: guess. 6 F ina l s e l ec t ion of t h e set of
منابع مشابه
Part-of-Speech Tagging of Dutch with MBT, a Memory-Based Tagger Generator
We present a part of speech tagger (morphosyntactic disambiguator) for Dutch, constructed by means of the Memory-Based Tagger generation method. In this approach, inductive learning methods are used to derive a tagger, lexicon and unknown word category guesser fully automatically from a tagged example corpus. Advantages of the approach are (i) fast tagger development time without linguistic eng...
متن کاملTowards Czech Morphological Guesser
This paper presents a morphological guesser for Czech based on data from Czech morphological analyzer ajka [1]. The idea behind the presented concept lies in a presumption that the new (and therefore unknown to the analyzer) words in a language behave quite regularly and that a description of this regular behaviour can be extracted from the existing data of the morphological analyzer. The paper...
متن کاملListeners retune phoneme categories across languages.
Native listeners adapt to noncanonically produced speech by retuning phoneme boundaries by means of lexical knowledge. We asked whether a second language lexicon can also guide category retuning and whether perceptual learning transfers from a second language (L2) to the native language (L1). During a Dutch lexical-decision task, German and Dutch listeners were exposed to unusual pronunciation ...
متن کاملAcquisition of Aspectual Meanings in a Language with and a Language without Morphological Aspect
This comprehension study on aspectual form-to-meaning correspondences set out to see if the presence of aspect as morphological category in a language makes the acquisition of aspectual form-meaning pairs relatively easy in comparison to a language that lacks such an aspect category. In Polish, aspect is a grammatical category—all verbs are marked as perfective or imperfective—and tense is mark...
متن کاملPolish Morphological Guesser Based on a Statistical A Tergo Index
We present a direct method of construction of a morphosyntactic guesser for Polish, which is a program producing morphosyntactic descriptions for word forms unknown to the morphological analyser. The core of the method is the construction of a statistical a tergo index, in which pseudo-suffixes (endings) extracted by a statistical tree define morpho-syntactic properties of corresponding word fo...
متن کامل